comparing k-means clusters on parallel persian-english corpus
نویسندگان
چکیده
this paper compares clusters of aligned persian and english texts obtained from k-means method. text clustering has many applications in various fields of natural language processing. so far, much english documents clustering research has been accomplished. now this question arises, are the results of them extendable to other languages? since the goal of document clustering is grouping of documents based on their content, it is expected that the answer to this question is yes. on the other hand, many differences between various languages can cause the answer to this question to be no. this research has focused on k-means that is one of the basic and popular document clustering methods. we want to know whether the clusters of aligned persian and english texts obtained by the k-means are similar. to find an answer to this question, mizan english-persian parallel corpus was considered as benchmark. after features extraction using text mining techniques and applying the pca dimension reduction method, the k-means clustering was performed. the morphological difference between english and persian languages caused the larger feature vector length for persian. so almost in all experiments, the english results were slightly richer than those in persian. aside from these differences, the overall behavior of persian and english clusters was similar. these similar behaviors showed that results of k-means research on english can be expanded to persian. finally, there is hope that despite many differences between various languages, clustering methods may be extendable to other languages.
منابع مشابه
Comparing k-means clusters on parallel Persian-English corpus
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...
متن کاملMIZAN: A Large Persian-English Parallel Corpus
One of the most major and essential tasks in natural language processing is machine translation that is now highly dependent upon multilingual parallel corpora. Through this paper, we introduce the biggest Persian-English parallel corpus with more than one million sentence pairs collected from masterpieces of literature. We also present acquisition process and statistics of the corpus, and expe...
متن کاملTEP: Tehran English-Persian Parallel Corpus
Parallel corpora are one of the key resources in natural language processing. In spite of their importance in many multi-lingual applications, no large-scale English-Persian corpus has been made available so far, given the difficulties in its creation and the intensive labors required. In this paper, the construction process of Tehran English-Persian parallel corpus (TEP) using movie subtitles,...
متن کاملPEN: Parallel English-Persian News Corpus
Parallel corpora are the necessary resources in many multilingual natural language processing applications, including machine translation and cross-lingual information retrieval. Manual preparation of a large scale parallel corpus is a very time consuming and costly procedure. In this paper, the work towards building a sentence-level aligned EnglishPersian corpus in a semi-automated manner is p...
متن کاملExtracting an English-Persian Parallel Corpus from Comparable Corpora
Parallel data are an important part of a reliable Statistical Machine Translation (SMT) system. The more of these data are available, the better the quality of the SMT system. However, for some language pairs such as Persian-English, parallel sources of this kind are scarce. In this paper, a bidirectional method is proposed to extract parallel sentences from English and Persian document aligned...
متن کاملCreating a Persian-English Comparable Corpus
Multilingual corpora are valuable resources for cross-language information retrieval and are available in many language pairs. However the Persian language does not have rich multilingual resources due to some of its special features and difficulties in constructing the corpora. In this study, we build a Persian-English comparable corpus from two independent news collections: BBC News in Englis...
متن کاملمنابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
journal of ai and data miningناشر: shahrood university of technology
ISSN 2322-5211
دوره 3
شماره 2 2015
میزبانی شده توسط پلتفرم ابری doprax.com
copyright © 2015-2023